collate not support Unicode Variation Selector

  • Jump to comment-1
    n2029@ndensan.co.jp2022-08-02T23:54:31+00:00
    Hi, I hope supported Unicode Variation Selector on collate. D209007=# create table ivstesticu ( D209007(# moji text D209007(# ); D209007=# create table ivstest ( D209007(# moji text collate "ja-x-icu" CONSTRAINT firstkey PRIMARY KEY D209007(# ); D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+0E0101' || U&'\+00304D'); D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+00304D'); D209007=# select moji from ivstest where moji like '%' || U&'\+00304B' || '%'; ------------- 㐶󠄁き 㐶き (2 行) expected ------------- 㐶き (1 行) Best regards,
    • Jump to comment-1
      n2029@ndensan.co.jp2022-08-03T00:09:35+00:00
      Hi, I hope supported Unicode Variation Selector on collate. I will resend it because there was a typo. D209007=# create table ivstest ( moji text collate "ja-x-icu" CONSTRAINT firstkey PRIMARY KEY ); D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+0E0101' || U&'\+00304D'); D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+00304D'); D209007=# select moji from ivstest where moji like '%' || U&'\+003436' || '%'; ------------- 㐶󠄁き 㐶き (2 行) expected ------------- 㐶き (1 行) Best regards,
      • Jump to comment-1
        thomas.munro@gmail.com2022-08-03T00:41:51+00:00
        On Wed, Aug 3, 2022 at 12:09 PM 荒井元成 <n2029@ndensan.co.jp> wrote: > D209007=# create table ivstest ( moji text collate "ja-x-icu" CONSTRAINT firstkey PRIMARY KEY ); > D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+0E0101' || U&'\+00304D'); > D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+00304D'); > D209007=# select moji from ivstest where moji like '%' || U&'\+003436' || '%'; > ------------- > 㐶󠄁き > 㐶き > (2 行) > > expected > ------------- > 㐶き > (1 行) So you want to match only strings that contain U&'\+003436' *not* followed by a variation selector (as we also discussed at [1]). I'm pretty sure that everything in PostgreSQL considers variation selectors to be separate characters. Perhaps it is possible to write a regular expression covering the variation selector ranges, something like '\U00003436[^\U000E0100-\U000E010EF]'? Here's an example using Latin characters that are easier for me, but show approximately the same thing, since variation selectors are a bit like "combining" characters: postgres=# create table t (x text); CREATE TABLE postgres=# insert into t values ('e'), ('ef'), ('e' || U&'\0301'); INSERT 0 3 postgres=# select * from t; x ---- e ef é (3 rows) postgres=# select * from t where x ~ 'e([^\u0300-\u036f]|$)'; x ---- e ef (2 rows) [1] https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp
        • Jump to comment-1
          tgl@sss.pgh.pa.us2022-08-03T00:56:55+00:00
          Thomas Munro <thomas.munro@gmail.com> writes: > So you want to match only strings that contain U&'\+003436' *not* > followed by a variation selector (as we also discussed at [1]). I'm > pretty sure that everything in PostgreSQL considers variation > selectors to be separate characters. There might be something that doesn't, but LIKE certainly isn't it. I don't believe plain LIKE is collation-aware at all, it just sees characters to match or not match. ILIKE is a little collation-aware, but it's still not going to consider a combining sequence as one character. The same for the regex operators. Maybe it would help if you run the strings through normalize() first? I'm not sure if that can combine combining characters. regards, tom lane
          • Jump to comment-1
            thomas.munro@gmail.com2022-08-03T02:02:08+00:00
            On Wed, Aug 3, 2022 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Maybe it would help if you run the strings through normalize() first? > I'm not sure if that can combine combining characters. I think the similarity between Latin combining characters and these ideographic variations might end there. I don't think there is a single codepoint version of U&'\+003436' || U&'\+0E0101', unlike é. This system is for controlling small differences in rendering for the "same" character[1]. My computer doesn't even show the OP's example glyphs as different (to my eyes, at least; I can see on a random picture I found[2] that the one with the e0101 selector is supposed to have a ... what do you call that ... a tiny gap :-)). [1] http://www.unicode.org/reports/tr37/tr37-14.html [2] https://glyphwiki.org/wiki/u3436
            • Jump to comment-1
              horikyota.ntt@gmail.com2022-08-03T06:25:32+00:00
              At Wed, 3 Aug 2022 14:02:08 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > On Wed, Aug 3, 2022 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Maybe it would help if you run the strings through normalize() first? > > I'm not sure if that can combine combining characters. > > I think the similarity between Latin combining characters and these > ideographic variations might end there. I don't think there is a > single codepoint version of U&'\+003436' || U&'\+0E0101', unlike é. Right. At least in Japanese texts, the two "character"s are the same glyph. In that sense the loss of variation selectors from a text doesn't alter its meaning and doesn't hurt correctness at all. Ideographic variation is useful in special cases where their ideographic identity is crucial. > This system is for controlling small differences in rendering for the > "same" character[1]. My computer doesn't even show the OP's example > glyphs as different (to my eyes, at least; I can see on a random > picture I found[2] that the one with the e0101 selector is supposed to > have a ... what do you call that ... a tiny gap :-)). They need variation-aware fonts and application support to render. So when even *I* see the two characters on Excel (which I believe doesn't have that support by default), they would look exactly same. In that sense, my opinion on the behavior is that all ideographic variations rather should be treated as the same character in searching in general context. In other words, text matching should just drop variation selectors as the default behavior. ICU:Collator [1] has the notion of "collation strength" and I saw in an article that only Colator::IDENTICAL among five alternatives makes distinction between ideographic variations of a glyph. > [1] http://www.unicode.org/reports/tr37/tr37-14.html > [2] https://glyphwiki.org/wiki/u3436 [1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1Collator.html regards. -- Kyotaro Horiguchi NTT Open Source Software Center
              • Jump to comment-1
                n2029@ndensan.co.jp2022-08-03T11:12:53+00:00
                Thank you for your reply. About 60,000 characters are registered in the IPAmj Mincho font designated by the national specifications. It should be able to handle all characters. regards. -----Original Message----- From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Sent: Wednesday, August 3, 2022 3:26 PM To: thomas.munro@gmail.com Cc: tgl@sss.pgh.pa.us; n2029@ndensan.co.jp; pgsql-hackers@lists.postgresql.org Subject: Re: collate not support Unicode Variation Selector At Wed, 3 Aug 2022 14:02:08 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > On Wed, Aug 3, 2022 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Maybe it would help if you run the strings through normalize() first? > > I'm not sure if that can combine combining characters. > > I think the similarity between Latin combining characters and these > ideographic variations might end there. I don't think there is a > single codepoint version of U&'\+003436' || U&'\+0E0101', unlike é. Right. At least in Japanese texts, the two "character"s are the same glyph. In that sense the loss of variation selectors from a text doesn't alter its meaning and doesn't hurt correctness at all. Ideographic variation is useful in special cases where their ideographic identity is crucial. > This system is for controlling small differences in rendering for the > "same" character[1]. My computer doesn't even show the OP's example > glyphs as different (to my eyes, at least; I can see on a random > picture I found[2] that the one with the e0101 selector is supposed to > have a ... what do you call that ... a tiny gap :-)). They need variation-aware fonts and application support to render. So when even *I* see the two characters on Excel (which I believe doesn't have that support by default), they would look exactly same. In that sense, my opinion on the behavior is that all ideographic variations rather should be treated as the same character in searching in general context. In other words, text matching should just drop variation selectors as the default behavior. ICU:Collator [1] has the notion of "collation strength" and I saw in an article that only Colator::IDENTICAL among five alternatives makes distinction between ideographic variations of a glyph. > [1] http://www.unicode.org/reports/tr37/tr37-14.html > [2] https://glyphwiki.org/wiki/u3436 [1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1Collator.html regards. -- Kyotaro Horiguchi NTT Open Source Software Center
                • Jump to comment-1
                  horikyota.ntt@gmail.com2022-08-04T08:23:48+00:00
                  At Wed, 3 Aug 2022 20:12:53 +0900, 荒井元成 <n2029@ndensan.co.jp> wrote in > Thank you for your reply. > > About 60,000 characters are registered in the IPAmj Mincho font designated by the national specifications. > It should be able to handle all characters. Yeah, it is one of that fonts. But I didn't know that MS-Word can *display* ideographic variations. But it is dissapoinging that input requires to copy-paste from the Web.. Maybe that characters can be input smoothly by using ATOK or alikes.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
                  • Jump to comment-1
                    n2029@ndensan.co.jp2022-08-04T10:01:33+00:00
                    Thank you for your reply. SQLServer supports Unicode Variation Selector, so I would like PostgreSQL to support them as well. Regards. -- Motonari Arai -----Original Message----- From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Sent: Thursday, August 4, 2022 5:24 PM To: n2029@ndensan.co.jp Cc: thomas.munro@gmail.com; tgl@sss.pgh.pa.us; pgsql-hackers@lists.postgresql.org Subject: Re: collate not support Unicode Variation Selector At Wed, 3 Aug 2022 20:12:53 +0900, 荒井元成 <n2029@ndensan.co.jp> wrote in > Thank you for your reply. > > About 60,000 characters are registered in the IPAmj Mincho font designated by the national specifications. > It should be able to handle all characters. Yeah, it is one of that fonts. But I didn't know that MS-Word can *display* ideographic variations. But it is dissapoinging that input requires to copy-paste from the Web.. Maybe that characters can be input smoothly by using ATOK or alikes.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
                    • Jump to comment-1
                      horikyota.ntt@gmail.com2022-08-05T06:50:32+00:00
                      At Thu, 4 Aug 2022 19:01:33 +0900, 荒井元成 <n2029@ndensan.co.jp> wrote in > Thank you for your reply. > > SQLServer supports Unicode Variation Selector, so I would like PostgreSQL to > support them as well. I studied the code a bit further, then found that simple comparison can ignore selectors by using nondeterministic collation. CREATE COLLATION col1 (provider=icu, locale='ja', deterministic=false); SELECT (U&'\+003436' || U&'\+0E0101' || U&'\+00304D' collate col1) = U&'\+003436' || U&'\+00304D'; ?column? ---------- t However LIKE dislikes this. > ERROR: nondeterministic collations are not supported for LIKE Deterministic collations assume text equality means bytewise equality. So, the "problem" behavior is correct in a sense. In that sense, those functions that do not support nondeterministic collations can be implemented without considering ICU, which leads to the "problem" behavior. ICU has regular expression function so LIKE might be ableto be implemented using this. If it is done, and if a non-deterministic IVS-sensitive collation is available (I didin't find how to get one..), LIKE would work as you expect. But.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center